Assignment

This assignment is aim to solve the problems of for Mini Challenge 2

LI NAN https://www.linkedin.com/in/li-nan-63b9251a6/
07-12-2021

1.Data Preparation

1.1 Global Settings

The global settings of R code chunks in this post is set as follows.

1.2 R Packages Installation

The following code input is to prepare for R Packages Installation.

# !diagnostics off
packages = c('raster','sf','tmap', 'clock','DT', 'ggiraph', 'plotly', 'tidyverse','dplyr','readr','hrbrthemes','tmap','mapview')
for(p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

1.3 Data Import

The following code is to import raw data sets from Mini Challenge2(“car-assignment.csv”,“cc_data.csv”,“gps.csv”,“loyalty_data.csv”).

credit_debit <- read_csv("data/cc_data.csv")
loyalty_data <- read_csv("data/loyalty_data.csv")
car_assignment <- read_csv("data/car_assignments.csv")
GPS <- read_csv("data/gps.csv")

2.Tasks and Questions for Mini-Challenge2

2.1 Q1 Intruoduction

Using just the credit and loyalty card data, identify the most popular locations, and when they are popular. What anomalies do you see? What corrections would you recommend to correct these anomalies?

2.1.1 Data Preparation for Q1

Comparison of total amount between credit/debit card and loyalty card

To know which place is most populated and when it is populated, we need a data table to list the most populated places and its time.

# !diagnostics off
loyalty_data$count_event=1
credit_debit$count_event=1

aggregate_dataset <- loyalty_data %>% 
    group_by(timestamp,location) %>% 
    dplyr::summarize(Frequency = sum(count_event),Money_loyalty=sum(price))


credit_debit$timestamp <- strptime(credit_debit$timestamp, "%m/%d/%Y %H:%M")
aggregate_cc <- credit_debit %>% 
    group_by(timestamp,location) %>% 
    dplyr::summarize(Frequency = sum(count_event),Money_cd=sum(price))
Adjustment of Date Type and create a new column named “Day”
aggregate_dataset$timestamp <- as.Date(aggregate_dataset$timestamp, "%m/%d/%Y")

aggregate_dataset$Day <- format(aggregate_dataset$timestamp, format="%d")
aggregate_cc$Day <- format(aggregate_cc$timestamp, format="%d")
Aggregation of loyalty cost and credit_debit cost
loyalty_money <- aggregate_dataset %>% group_by(Day,location) %>% dplyr::summarise(money_loyal=sum(Money_loyalty),freq_loyal=sum(Frequency))

cc_money <- aggregate_cc %>% group_by(Day,location) %>% dplyr::summarise(money_cc = sum(Money_cd),freq_cc=sum(Frequency))
Table 2.1Combination of loyalty_money and cc_money
Comparison <- full_join(cc_money, loyalty_money, by = c('Day','location'))
Comparison[is.na(Comparison)] <- 0
Comparison$Money_dif=Comparison$money_cc-Comparison$money_loyal
Comparison$Freq_dif=Comparison$freq_cc-Comparison$freq_loyal

Comparison<-Comparison%>%
arrange(freq_cc)

datatable(Comparison,rownames = FALSE)

During data exploration, we can see there are five records that don’t have any cost in credit card and debit card,but there are consumption records in loyalty card.

Table 2.2Combination of loyalty_money and cc_money
Result1 <- Comparison  %>%
dplyr::group_by(Day) %>%
filter(freq_cc == max(freq_cc)) %>%
arrange(desc(Day))

datatable(Result1,rownames = FALSE)

From the new data frame “Result1”, Now we can see that Katerina’s Cafe is the most popular place based on data records from Day 6 to Day 19,which appears 6 times in 14 days records.

To find out more anomalies from the data, we need more obvious data visualization.

2.1.2 Data Visualization

new column: text for tooltip

Comparison$Money_dif <- round(Comparison$Money_dif ,2)

Comparison <- Comparison %>%
  mutate(text = paste0("Location: ", location, "\n", "Day of January: ", Day, "\n", "Money Difference: ",Money_dif))



Comparison <- Comparison%>%
  mutate(text2 = paste0("Location: ", location, "\n", "Day of January: ", Day, "\n", "Frequency Difference: ",Freq_dif))

Heat map of money difference

p <- ggplot(data = Comparison, aes(x=Day, y=location,fill=Money_dif,text=text)) + 
  geom_tile() +
  geom_text(aes(label = Money_dif)) +
  scale_fill_gradient(low="pink", high="blue") +
  theme_ipsum()

p <- p + theme(axis.text.y = element_text(size = 8))

ggplotly(p, tooltip="text")

Figure 1 Money difference

Figure:Heat map of frequency difference

z <- ggplot(data = Comparison, aes(x=Day, y=location,fill=Freq_dif,text=text2)) + 
  geom_tile() +
  scale_fill_gradient(low="light yellow", high="red") +
  geom_text(aes(label = Freq_dif))+
  theme_ipsum()

z <- z + theme(axis.text.y = element_text(size = 8))

ggplotly(z, tooltip="text2")

Figure 2 Frequency difference

2.1.3 Infer and Analysis

we can see more anomalies comparing these two heat maps: 1.In these two weeks,except Maximum Iron and steel which the differences in money and frequencies are both 0, other places in these two weeks all appear difference in some days either in frequency or money.

2.In Frydos Auto Supply on Day 13, it has a large cost consumption of 9912.43 but the heat map of frequency difference shows 0 in frequency difference. Through DT function to trace back the raw data, we can find that loyalty card shows total consumption is 542.79,and credit and debit cards show money spent is 10455.22,but both their number of consumption record is the same. It is quite strange.

3.Another anomaly is from data table 4.1, there are five records showing that credit card and debit card consumption cost are 0, but loyalty card has consumption records .And among these 5 records, the most doubtful part is that Stewart and Sons Fabrication in Day 13 has 4071.95 cost,which is also needed to be noted.

2.2 Q2 Intruoduction

Add the vehicle data to your analysis of the credit and loyalty card data. How does your assessment of the anomalies in question 1 change based on this new data? What discrepancies between vehicle, credit, and loyalty card data do you find? Please limit your answer to 8 images and 500 words.

2.2.1 Data Preparation for Q2

To proceed in the Q2, we decide to have data manipulation for another two datasets GPS and car_Assignment.

Data Manipulation for car_Assignment data set(Make up full name)

car_assignment <-car_assignment %>% unite("Full Name", LastName:FirstName, remove = FALSE)
car_assign2 = subset(car_assignment, select = -c(LastName,FirstName) )
head(car_assign2)
# A tibble: 6 x 4
  `Full Name`     CarID CurrentEmploymentType  CurrentEmploymentTitle
  <chr>           <dbl> <chr>                  <chr>                 
1 Calixto_Nils        1 Information Technology IT Helpdesk           
2 Azada_Lars          2 Engineering            Engineer              
3 Balas_Felix         3 Engineering            Engineer              
4 Barranco_Ingrid     4 Executive              SVP/CFO               
5 Baza_Isak           5 Information Technology IT Technician         
6 Bergen_Linnea       6 Information Technology IT Group Manager      

Data Manipulation for plot out route map

glimpse(GPS)
Rows: 685,169
Columns: 4
$ Timestamp <chr> "01/06/2014 06:28:01", "01/06/2014 06:28:01", "01/~
$ id        <dbl> 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35~
$ lat       <dbl> 36.07623, 36.07622, 36.07621, 36.07622, 36.07621, ~
$ long      <dbl> 24.87469, 24.87460, 24.87444, 24.87425, 24.87417, ~
bgmap <- raster("Data/MC2-tourist.tif")
bgmap
class      : RasterLayer 
band       : 1  (of  3  bands)
dimensions : 1595, 2706, 4316070  (nrow, ncol, ncell)
resolution : 3.16216e-05, 3.16216e-05  (x, y)
extent     : 24.82419, 24.90976, 36.04499, 36.09543  (xmin, xmax, ymin, ymax)
crs        : +proj=longlat +datum=WGS84 +no_defs 
source     : MC2-tourist.tif 
names      : MC2.tourist 
values     : 0, 255  (min, max)
tm_shape(bgmap) +
tm_rgb(bgmap, r = 1,g = 2,b = 3,
       alpha = NA,
       saturation = 1,
       interpolate = TRUE,
       max.value = 255)

Abila_st <- st_read(dsn = "Data/Geospatial",
                    layer = "Abila")
Reading layer `Abila' from data source 
  `C:\linanyaogaibian\Dataviz_blog\_posts\2021-07-13-assignment\Data\Geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 3290 features and 9 fields
Geometry type: LINESTRING
Dimension:     XY
Bounding box:  xmin: 24.82401 ymin: 36.04502 xmax: 24.90997 ymax: 36.09492
Geodetic CRS:  WGS 84
GPS$Timestamp <- strptime(GPS$Timestamp, "%m/%d/%Y %H:%M:%S")
GPS$day <- as.factor(get_day(GPS$Timestamp))
GPS$id <- as_factor(GPS$id)
glimpse(GPS)
Rows: 685,169
Columns: 5
$ Timestamp <dttm> 2014-01-06 06:28:01, 2014-01-06 06:28:01, 2014-01~
$ id        <fct> 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35~
$ lat       <dbl> 36.07623, 36.07622, 36.07621, 36.07622, 36.07621, ~
$ long      <dbl> 24.87469, 24.87460, 24.87444, 24.87425, 24.87417, ~
$ day       <fct> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,~
GPS_sf <- st_as_sf(GPS, 
                   coords = c("long", "lat"),
                       crs= 4326)
GPS_sf
Simple feature collection with 685169 features and 3 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 24.82509 ymin: 36.04802 xmax: 24.90849 ymax: 36.08996
Geodetic CRS:  WGS 84
# A tibble: 685,169 x 4
   Timestamp           id    day              geometry
 * <dttm>              <fct> <fct>         <POINT [°]>
 1 2014-01-06 06:28:01 35    6     (24.87469 36.07623)
 2 2014-01-06 06:28:01 35    6      (24.8746 36.07622)
 3 2014-01-06 06:28:03 35    6     (24.87444 36.07621)
 4 2014-01-06 06:28:05 35    6     (24.87425 36.07622)
 5 2014-01-06 06:28:06 35    6     (24.87417 36.07621)
 6 2014-01-06 06:28:07 35    6     (24.87406 36.07619)
 7 2014-01-06 06:28:09 35    6     (24.87391 36.07619)
 8 2014-01-06 06:28:10 35    6     (24.87381 36.07618)
 9 2014-01-06 06:28:11 35    6     (24.87374 36.07617)
10 2014-01-06 06:28:12 35    6     (24.87362 36.07618)
# ... with 685,159 more rows
gps_path <- GPS_sf %>%
  group_by(id,day) %>%
  summarize(m = mean(Timestamp), 
            do_union=FALSE) %>%
  st_cast("LINESTRING")
gps_path
Simple feature collection with 508 features and 3 fields
Geometry type: LINESTRING
Dimension:     XY
Bounding box:  xmin: 24.82509 ymin: 36.04802 xmax: 24.90849 ymax: 36.08996
Geodetic CRS:  WGS 84
# A tibble: 508 x 4
# Groups:   id [40]
   id    day   m                                              geometry
   <fct> <fct> <dttm>                                 <LINESTRING [°]>
 1 1     6     2014-01-06 15:02:08 (24.88258 36.06646, 24.88259 36.06~
 2 1     7     2014-01-07 12:41:07 (24.87957 36.04803, 24.87957 36.04~
 3 1     8     2014-01-08 14:35:25 (24.88265 36.06643, 24.88266 36.06~
 4 1     9     2014-01-09 12:04:45 (24.88261 36.06646, 24.88257 36.06~
 5 1     10    2014-01-10 16:04:58 (24.88265 36.0665, 24.88261 36.066~
 6 1     11    2014-01-11 16:18:32 (24.88258 36.06651, 24.88246 36.06~
 7 1     12    2014-01-12 13:31:05 (24.88259 36.06643, 24.8824 36.066~
 8 1     13    2014-01-13 13:46:15 (24.88265 36.06642, 24.8826 36.066~
 9 1     14    2014-01-14 14:04:23 (24.88261 36.06644, 24.88262 36.06~
10 1     15    2014-01-15 15:33:54 (24.88263 36.06647, 24.88257 36.06~
# ... with 498 more rows
Delete Orphan route
p = npts(gps_path, by_feature = TRUE)
gps_path2 <- cbind(gps_path, p)
gps_path2 <- dplyr::filter(gps_path2,p!=1)
From Q1, we noticed that Day 13 has a large amount of money difference that reaches 9912.43 in the Fry dos Auto Supply, we want to notice who park their car in Frydos Auto Supply that is possible to buy a vehicle,we create a route map to show that cars whose ID are 12,13,15,16,20,34 parsed through this place,these car drivers have suspicious.
gps_path_selected <- gps_path2 %>%
  group_by(id,day)%>%
  filter(day==13
         )
tmap_mode("view")
tm_shape(bgmap) + 
  tm_rgb(bgmap, r = 1,g = 2,b = 3,
       alpha = NA,
       saturation = 1,
       interpolate = TRUE,
       max.value = 255) +
  tm_shape(gps_path_selected) +
  tm_lines()

Q3 Can you infer the owners of each credit card and loyalty card? What is your evidence? Where are there uncertainties in your method? Where are there uncertainties in the data? Please limit your answer to 8 images and 500 words.

Answer: To answer Q3’s question, we need to build a relationship between credit card & loyalty card owner and car owner. So this relationship connection is based on parking car site and location in the map. We need to find a logical time gap that can be the proof to help real location site to match parking car site.

So first of all, restructure GPS data set to create minute gap and build a box plot to find the appropriate time gap for parking car.

GPS_track <- GPS_sf %>%
  dplyr::arrange(day, Timestamp) %>%
  group_by(id,day) %>%
  mutate(diff = Timestamp - lag(Timestamp),
         diff_mins = as.numeric(diff, units = 'mins'))
GPS_track$diff_mins <- round(GPS_track$diff_mins,2)
GPS_track <- GPS_track %>%
  mutate_at(vars(diff_mins), ~replace_na(., 0))

GPS_track$hour<- format(GPS_track$Timestamp, format="%H")

Now we want to justify which time gap is suitable for a parking time period, we build up a box plot based on dif_mins column.

boxplot1=ggplot(GPS_track,aes(x="",y=GPS_track$diff_mins))+geom_boxplot()+labs(title="Distribution of time gap")+theme_classic()


ggplotly(boxplot1)

Since the boxplot shows that more than 3/4 data points of dif_mins column are 0, it is useless for us to have a justification for parking time,but we also find one car’s movement is unusual and its time gap is 1058.27min. So we can only assume that data points whose dif_mins >5 are recognized as parking point. Based on this assumption, we build up a new column called parking point.

GPS_track$point <- 0
GPS_track$point[GPS_track$diff_mins >5 ] <- 1
GPS_track$point[GPS_track$diff_mins <=5 ] <- 0
glimpse(GPS_track)
Rows: 685,169
Columns: 8
Groups: id, day [508]
$ Timestamp <dttm> 2014-01-06 06:28:01, 2014-01-06 06:28:01, 2014-01~
$ id        <fct> 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35~
$ day       <fct> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,~
$ geometry  <POINT [°]> POINT (24.87469 36.07623), POINT (24.8746 36~
$ diff      <drtn> NA secs, 0 secs, 2 secs, 2 secs, 1 secs, 1 secs, ~
$ diff_mins <dbl> 0.00, 0.00, 0.03, 0.03, 0.02, 0.02, 0.03, 0.02, 0.~
$ hour      <chr> "06", "06", "06", "06", "06", "06", "06", "06", "0~
$ point     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~

Aggregate the points so that we can know one car driver in one day park its car in how many places?

Tracker_GPS <- aggregate(GPS_track$point, by=list(id=GPS_track$id,day=GPS_track$day,hour=GPS_track$hour), FUN=sum)
glimpse(Tracker_GPS)
Rows: 2,966
Columns: 4
$ id   <fct> 3, 19, 26, 29, 1, 28, 1, 16, 1, 15, 16, 1, 15, 24, 16, ~
$ day  <fct> 11, 11, 11, 12, 16, 19, 7, 7, 7, 7, 7, 9, 9, 9, 11, 11,~
$ hour <chr> "00", "00", "00", "00", "00", "00", "01", "01", "03", "~
$ x    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0~
glimpse(car_assign2)
Rows: 44
Columns: 4
$ `Full Name`            <chr> "Calixto_Nils", "Azada_Lars", "Balas_~
$ CarID                  <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12~
$ CurrentEmploymentType  <chr> "Information Technology", "Engineerin~
$ CurrentEmploymentTitle <chr> "IT Helpdesk", "Engineer", "Engineer"~
Build a data table named Car_owner to combine info of employee and car driver.
car_assign2$CarID <- as_factor(car_assign2$CarID)
Car_owner <- full_join(Tracker_GPS,car_assign2,  by = c("id" ="CarID"))
glimpse(loyalty_data)
Rows: 1,392
Columns: 5
$ timestamp   <chr> "1/8/2014", "1/8/2014", "1/14/2014", "1/9/2014",~
$ location    <chr> "Carlyle Chemical Inc.", "Carlyle Chemical Inc."~
$ price       <dbl> 4983.52, 4901.88, 4898.39, 4792.50, 4788.22, 474~
$ loyaltynum  <chr> "L8477", "L5756", "L2769", "L3317", "L8477", "L5~
$ count_event <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
credit_debit$hour<- format(credit_debit$timestamp, format="%H")

glimpse(credit_debit)
Rows: 1,490
Columns: 6
$ timestamp   <dttm> 2014-01-06 07:28:00, 2014-01-06 07:34:00, 2014-~
$ location    <chr> "Brew've Been Served", "Hallowed Grounds", "Brew~
$ price       <dbl> 11.34, 52.22, 8.33, 16.72, 4.24, 4.17, 28.73, 9.~
$ last4ccnum  <dbl> 4795, 7108, 6816, 9617, 7384, 5368, 7253, 4948, ~
$ count_event <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
$ hour        <chr> "07", "07", "07", "07", "07", "07", "07", "07", ~
loyalty_data$timestamp <- as.Date(loyalty_data$timestamp, "%m/%d/%Y")

loyalty_data$day <- as.factor(get_day(loyalty_data$timestamp))
loyalty_data$loyaltynum <- as.factor(loyalty_data$loyaltynum)
loyalty_people <- 
aggregate(loyalty_data$count_event, by=list(loyalnum=loyalty_data$loyaltynum,day=loyalty_data$day), FUN=sum)

loyalty_people$loyalnum <- as_factor(loyalty_people$loyalnum)
credit_debit$last4ccnum <- as.factor(credit_debit$last4ccnum)
credit_debit$day <- as.factor(get_day(credit_debit$timestamp))


cd_people <- 
aggregate(credit_debit$count_event, by=list(last4cnum=credit_debit$last4ccnum,day=credit_debit$day,hour=credit_debit$hour), FUN=sum)

cd_people$last4cnum <- as_factor(cd_people$last4cnum)
glimpse(cd_people)
Rows: 1,481
Columns: 4
$ last4cnum <fct> 8156, 5407, 3484, 8332, 9551, 9551, 2142, 2681, 34~
$ day       <fct> 12, 13, 19, 19, 19, 13, 6, 6, 6, 6, 6, 6, 6, 6, 6,~
$ hour      <chr> "03", "03", "03", "03", "03", "06", "07", "07", "0~
$ x         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
glimpse(loyalty_people)
Rows: 606
Columns: 3
$ loyalnum <fct> L1107, L1485, L1682, L2070, L2169, L2247, L2343, L2~
$ day      <fct> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, ~
$ x        <dbl> 2, 2, 3, 2, 3, 3, 3, 4, 1, 3, 2, 3, 3, 3, 3, 1, 3, ~
car_cl<-inner_join(cd_people,Car_owner,by=c("day","x","hour"))

Q5. Do you see evidence of suspicious activity? Identify1- 10 locations where you believe the suspicious activity is occurring,and why Please limit your response to 10 images and 500 words.

First suspicious activity Day 13 has a large amount of money difference that reaches 9912.43 in the Fry dos Auto Supply,although the number of cost records are the same(4).

Second suspicious activity